Skip to content

Conversation

@neutrinoceros
Copy link
Contributor

This is a trial run, not intended for merging (at least not yet), as I'm curious to see exactly which test(s) is taking so long that CI requires about 2 hours.

@codecov-commenter
Copy link

codecov-commenter commented Aug 1, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 45.19%. Comparing base (3373f93) to head (0c0cb83).
⚠️ Report is 25 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##           master     #677       +/-   ##
===========================================
- Coverage   62.28%   45.19%   -17.09%     
===========================================
  Files          11       87       +76     
  Lines         928    16835    +15907     
===========================================
+ Hits          578     7608     +7030     
- Misses        350     9227     +8877     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@neutrinoceros
Copy link
Contributor Author

I completely missed there were two separate calls to pytest, and of course I'm measuring the wrong one. Will update when back near a keyboard

@neutrinoceros
Copy link
Contributor Author

I'm just noticing now that .test_durations exists and is how pytest-split knows how to split the suite into "balanced" batches. But I'm curious to see if this run finds any prominent differences.

@neutrinoceros
Copy link
Contributor Author

I see the longest running tests are respectively ~30 and ~15 min . This is extremely long, CI jobs usually can complete well under a minute, so I'm puzzled.
@VChristiaens @IainHammond do you think it'd be acceptable that a manual selection of tests (these 2 included) be run only once in CI, as opposed to once per python version ? Also, could you give me a brief overlook of what these tests are doing and why they're expected to run for that long ?

@IainHammond
Copy link
Contributor

IainHammond commented Aug 2, 2025

I see the longest running tests are respectively ~30 and ~15 min . This is extremely long, CI jobs usually can complete well under a minute, so I'm puzzled. @VChristiaens @IainHammond do you think it'd be acceptable that a manual selection of tests (these 2 included) be run only once in CI, as opposed to once per python version ? Also, could you give me a brief overlook of what these tests are doing and why they're expected to run for that long ?

They do genuinely take this long and I'm aware it's far from ideal. I've been wanting to improve this for a while, and my approach has been to gradually speed up the functions themselves. This has had some degree of success since they used to run for almost 3 hours.

The tests spend the vast majority of their time on two functions - fm.firstguess and fm.mcmc_negfc_sampling. These are important functions but there might be a way to make the test less intensive while still simulating their real use. As for the python versions I'm open to ideas about this. They often reveal dependency issues and deprecated code so testing each version has actually been helpful. Just installing packages and import VIP doesn't always trigger an error. Do tests usually just have a single version of python? One option is to run the tests on 3.13 and the earlier versions just install and import VIP to see if anything pops up?

As for what the tests are doing, I believe Valentin has some small example data that he provides to most of the functions several times to test each use case.

@neutrinoceros
Copy link
Contributor Author

Yes, running most your tests against multiple Python versions is extremely valuable. What I'm proposing is to exclude a selection of the suite from normal jobs and run this same subset against a single Python version to save resources (and possibly some wall time as well)

@IainHammond
Copy link
Contributor

I'm okay with this, we'll see what Valentin thinks

@neutrinoceros
Copy link
Contributor Author

Other than that, regarding performance, have you considered rewriting performance critical loops in a low level language (C, C++, Rust, or event Cython) ?

@IainHammond
Copy link
Contributor

@neutrinoceros thanks for all of this! just going to piggy-back off this most recent PR and ask if there is an order to commit each of these PRs in? Or just accept the ones that are ready and leave the drafts?

@neutrinoceros
Copy link
Contributor Author

neutrinoceros commented Aug 4, 2025

Yes, there's an order in which my PRs are intended to be reviewed and merged. I'm very careful to only undraft the ones that have no dependencies, so please prioritize everything explicitly marked as ready for review (as in, not drafted). If more than one is ready at any given point, they should be independent and order doesn't matter.

@VChristiaens
Copy link
Contributor

VChristiaens commented Aug 14, 2025

Hi @neutrinoceros, thanks for raising this. As @IainHammond explained, we have put some efforts to try to reduce as much as possible the tests computing time although we have started to hit a barrier in order to (i) keep them meaningful and (ii) cover as many parts of the code at the same time as possible (to avoid minimizing overheads from partly re-running the same bits of preliminary codes in different tests). That being said, I dived in the longest remaining tests, and I think there is a way to reduce the longest one by ~40%, while not reducing the coverage nor the meaningfulness, as explained below. Thanks btw for highlighting these with the proposed change (which I think should also be fine to keep in the PR, unless you see a reason not too?).

In essence, for now the speckle_noise_uncertainty function, which estimates the uncertainties on planet parameters based on the injection of fake planets, is tested with tests/pre_3_10/test_fm_negfc_3d.py::test_algos[pca-firstguess-5-True-None-False-None] while calling the pca function as PSF subtraction algorithm. Using the medsub algorithm is faster though. I just tried it on my side. So maybe this PR could be the opportunity to implement that. The only change involved is this line to be replaced withif pca_algo == median_sub:

To further reduce the length of the 2 longest tests (test_fm_negfc_3d.py::test_algos[pca-firstguess-5-True-None-False-None] and test_fm_negfc_3d.py::test_algos[median_sub-firstguess-None-False-sum-False-None]), I think it'd be fine to only test them with the latest Python version. I believe there should be enough redundancy in terms of dependent packages and functions in the other tests, in particular the other 3 ones of the same kind, to capture any version-related issue. So green light on my side to implement that. Thanks a lot!

@neutrinoceros
Copy link
Contributor Author

Thanks btw for highlighting these with the proposed change (which I think should also be fine to keep in the PR, unless you see a reason not too?).

That's okay with me ! I initially used a pytest plugin to measure durations, but I didn't want to add another dependency, long term. Now that I've realized pytest had this feature natively, I don't mind if you guys want to merge the change as is.

So maybe this PR could be the opportunity to implement that. The only change involved is this line to be replaced withif pca_algo == median_sub:

I'll give it a spin !

we have put some efforts to try to reduce as much as possible the tests computing time although we have started to hit a barrier in order to (i) keep them meaningful and (ii) cover as many parts of the code at the same time as possible (to avoid minimizing overheads from partly re-running the same bits of preliminary codes in different tests).

I get the noble intention, but I find this approach quite counter-productive long term. I'll try to explain my perspective here (and to keep it brief :)):
By privileging high coverage-per-test-case, you're missing the opportunity to use tests as a form of design pressure. What I mean is that, going the other way, and trying to write tests that check as little as possible, one tends to design code that's intrinsically more modular and testable (so smaller bits can be tested independently). It also helps tremendously with covering parameter space significantly; combinatorics in functions that have more than a couple arguments typically make it impossible. As a side effect, aiming for smaller tests also tend to drastically reduce the duration of a full test session, to the extent that one might run every tests in a couple seconds instead of minutes or hours, so that it becomes possible to do it repeatedly and locally while developing, instead of once or twice when branches reach maturity.

Anyway, these are very general considerations, I'm not making a judgment on VIP specifically, nor have I studied it deeply enough to see where low hanging fruits would be.

@neutrinoceros neutrinoceros marked this pull request as ready for review August 15, 2025 10:38
@VChristiaens
Copy link
Contributor

Thanks for your explanations @neutrinoceros. I agree with the idea of testing as little as possible in each chunk. I think the test suite grew somewhat organically in VIP, and in some cases we saw the opportunity to incorporate very short bits of codes leveraging the outputs of another tested algorithm, which involved less code hassle than designing new tests.
That being said, at this stage, I don't see a way to chunk in smaller parts the longest remaining test, which is the main limiting in the test suite (an MCMC-based algorithm used to estimate uncertainties on the parameters of a fake planet), so I guess we'll have to live with it. Reducing the data or the number of iterations takes the risk of misestimating the uncertainties which is the whole point of the test.
Since the longest remaining test is significantly longer than the others, I was wondering: is there a way to specify manually how to split the tests in different groups? If that is possible, then the 2 longest remaining tests could basically make individual groups, and all the remaining ones be bundled together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants